[BugFix] Fix Ascend MoE routing expert count with EPLB#8864
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a bug in Ascend MoE routing where the dynamic EPLB configuration caused a mismatch between logical and physical expert counts. By correctly separating these counts and updating the quantization paths, the fix ensures that router logits and expert selection logic operate on the expected logical expert count, preventing assertion failures in distributed MoE scenarios. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
Suggested PR Title:
[Ops][Feature] Refactor MoE expert logic and unify SharedFusedMoESuggested PR Summary:
### What this PR does / why we need it?
This PR refactors the MoE implementation for Ascend by introducing a centralized `get_moe_num_logical_experts` utility to handle expert counts across various quantization methods (W4A16, W4A4, W8A8). It unifies `AscendSharedFusedMoE` into `AscendFusedMoE`, updates the runner to inherit from the standard `MoERunner`, and adds a consistency validation check for shared expert split computations. Feedback was provided regarding unresolved developer notes in Chinese and hardcoded logic in the `finalize` call within `fused_moe.py`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New unit tests were added in `tests/ut/quantization/methods/test_moe_logical_experts.py` to verify the logical expert calculation.Signed-off-by: gcanlin <canlinguosdu@gmail.com>
|
The CI error is unrelated to this PR. The bug is from #8831. It doesn't adapt to vllm main. vllm-project/vllm#39446 broke it. |
…8864) ## Summary Fix Ascend MoE dynamic EPLB routing after the upstream vLLM MoE/EPLB refactor. Upstream vLLM now distinguishes: - logical experts: the experts represented by router logits - physical/global experts: logical experts plus redundant EPLB replicas `router_logits.shape[-1]` matches the logical expert count, but Ascend MoE quant paths were comparing it against `moe_config.num_experts`, which can include redundant physical experts when dynamic EPLB is enabled. This caused: AssertionError: Number of global experts mismatch (excluding redundancy) in the Qwen3 MoE W8A8 dynamic EPLB TP2 test. ## Changes - Add a helper to resolve the logical expert count from moe_config.num_logical_experts, with a fallback for older configs. - Use logical expert count for: - router logits validation - expert selection - zero expert handling - profile force-load-balance random routing - Preserve physical/global expert count for dispatch and redundant expert handling. - Apply the same logical/physical split to related Ascend MoE quant paths to avoid the same bug in other quant modes. ## Root cause vLLM upstream PRs such as [#30623](vllm-project/vllm#30623) separated router logic into dedicated router classes and made EPLB map logical expert IDs to physical expert IDs after top-k selection. Ascend code still treated `moe_config.num_experts` as the `router-logits` expert count, but with dynamic EPLB it represents physical/global experts. ## Test ``` VLLM_USE_MODELSCOPE=True pytest -sv \ tests/e2e/multicard/2-cards/test_qwen3_moe.py::test_qwen3_moe_w8a8_distributed_tp2_ep_dynamic_eplb ``` ## Test Result ``` (APIServer pid=328985) INFO: 127.0.0.1:60556 - "POST /v1/completions HTTP/1.1" 200 OK [2026-05-02 10:28:33.692579][UC][I] Shutdown initiated (timeout=0) [329130,329130][core.py:1238,_handle_shutdown] [2026-05-02 10:28:33.692621][UC][I] Shutdown complete [329130,329130][core.py:1261,_handle_shutdown] [2026-05-02 10:28:33.692798][UC][I] Parent process exited, terminating worker queues [329270,330271][multiproc_executor.py:775,death_pipe_monitor] [2026-05-02 10:28:33.692812][UC][I] Parent process exited, terminating worker queues [329278,330263][multiproc_executor.py:775,death_pipe_monitor] [2026-05-02 10:28:33.692919][UC][I] WorkerProc shutting down. [329270,329270][multiproc_executor.py:872,worker_main] [2026-05-02 10:28:33.692960][UC][I] WorkerProc shutting down. [329278,329278][multiproc_executor.py:872,worker_main] (APIServer pid=328985) INFO: Shutting down (APIServer pid=328985) INFO: Waiting for application shutdown. (APIServer pid=328985) INFO: Application shutdown complete. (APIServer pid=328985) INFO: Finished server process [328985] (APIServer pid=328985) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute /usr/local/python3.11.14/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' PASSED ========================================================== warnings summary =========================================================== <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute vllm_ascend/patch/worker/patch_weight_utils.py:80 /root/vllm-workspace2/vllm-ascend/vllm_ascend/patch/worker/patch_weight_utils.py:80: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html module = original_import(name, globals, locals, fromlist, level) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================== 1 passed, 3 warnings in 342.04s (0:05:42) ============================================== sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute ``` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com>
…8864) ## Summary Fix Ascend MoE dynamic EPLB routing after the upstream vLLM MoE/EPLB refactor. Upstream vLLM now distinguishes: - logical experts: the experts represented by router logits - physical/global experts: logical experts plus redundant EPLB replicas `router_logits.shape[-1]` matches the logical expert count, but Ascend MoE quant paths were comparing it against `moe_config.num_experts`, which can include redundant physical experts when dynamic EPLB is enabled. This caused: AssertionError: Number of global experts mismatch (excluding redundancy) in the Qwen3 MoE W8A8 dynamic EPLB TP2 test. ## Changes - Add a helper to resolve the logical expert count from moe_config.num_logical_experts, with a fallback for older configs. - Use logical expert count for: - router logits validation - expert selection - zero expert handling - profile force-load-balance random routing - Preserve physical/global expert count for dispatch and redundant expert handling. - Apply the same logical/physical split to related Ascend MoE quant paths to avoid the same bug in other quant modes. ## Root cause vLLM upstream PRs such as [#30623](vllm-project/vllm#30623) separated router logic into dedicated router classes and made EPLB map logical expert IDs to physical expert IDs after top-k selection. Ascend code still treated `moe_config.num_experts` as the `router-logits` expert count, but with dynamic EPLB it represents physical/global experts. ## Test ``` VLLM_USE_MODELSCOPE=True pytest -sv \ tests/e2e/multicard/2-cards/test_qwen3_moe.py::test_qwen3_moe_w8a8_distributed_tp2_ep_dynamic_eplb ``` ## Test Result ``` (APIServer pid=328985) INFO: 127.0.0.1:60556 - "POST /v1/completions HTTP/1.1" 200 OK [2026-05-02 10:28:33.692579][UC][I] Shutdown initiated (timeout=0) [329130,329130][core.py:1238,_handle_shutdown] [2026-05-02 10:28:33.692621][UC][I] Shutdown complete [329130,329130][core.py:1261,_handle_shutdown] [2026-05-02 10:28:33.692798][UC][I] Parent process exited, terminating worker queues [329270,330271][multiproc_executor.py:775,death_pipe_monitor] [2026-05-02 10:28:33.692812][UC][I] Parent process exited, terminating worker queues [329278,330263][multiproc_executor.py:775,death_pipe_monitor] [2026-05-02 10:28:33.692919][UC][I] WorkerProc shutting down. [329270,329270][multiproc_executor.py:872,worker_main] [2026-05-02 10:28:33.692960][UC][I] WorkerProc shutting down. [329278,329278][multiproc_executor.py:872,worker_main] (APIServer pid=328985) INFO: Shutting down (APIServer pid=328985) INFO: Waiting for application shutdown. (APIServer pid=328985) INFO: Application shutdown complete. (APIServer pid=328985) INFO: Finished server process [328985] (APIServer pid=328985) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute /usr/local/python3.11.14/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' PASSED ========================================================== warnings summary =========================================================== <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute vllm_ascend/patch/worker/patch_weight_utils.py:80 /root/vllm-workspace2/vllm-ascend/vllm_ascend/patch/worker/patch_weight_utils.py:80: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html module = original_import(name, globals, locals, fromlist, level) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================== 1 passed, 3 warnings in 342.04s (0:05:42) ============================================== sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute ``` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: PiratePai <416932041@qq.com>
…8864) ## Summary Fix Ascend MoE dynamic EPLB routing after the upstream vLLM MoE/EPLB refactor. Upstream vLLM now distinguishes: - logical experts: the experts represented by router logits - physical/global experts: logical experts plus redundant EPLB replicas `router_logits.shape[-1]` matches the logical expert count, but Ascend MoE quant paths were comparing it against `moe_config.num_experts`, which can include redundant physical experts when dynamic EPLB is enabled. This caused: AssertionError: Number of global experts mismatch (excluding redundancy) in the Qwen3 MoE W8A8 dynamic EPLB TP2 test. ## Changes - Add a helper to resolve the logical expert count from moe_config.num_logical_experts, with a fallback for older configs. - Use logical expert count for: - router logits validation - expert selection - zero expert handling - profile force-load-balance random routing - Preserve physical/global expert count for dispatch and redundant expert handling. - Apply the same logical/physical split to related Ascend MoE quant paths to avoid the same bug in other quant modes. ## Root cause vLLM upstream PRs such as [#30623](vllm-project/vllm#30623) separated router logic into dedicated router classes and made EPLB map logical expert IDs to physical expert IDs after top-k selection. Ascend code still treated `moe_config.num_experts` as the `router-logits` expert count, but with dynamic EPLB it represents physical/global experts. ## Test ``` VLLM_USE_MODELSCOPE=True pytest -sv \ tests/e2e/multicard/2-cards/test_qwen3_moe.py::test_qwen3_moe_w8a8_distributed_tp2_ep_dynamic_eplb ``` ## Test Result ``` (APIServer pid=328985) INFO: 127.0.0.1:60556 - "POST /v1/completions HTTP/1.1" 200 OK [2026-05-02 10:28:33.692579][UC][I] Shutdown initiated (timeout=0) [329130,329130][core.py:1238,_handle_shutdown] [2026-05-02 10:28:33.692621][UC][I] Shutdown complete [329130,329130][core.py:1261,_handle_shutdown] [2026-05-02 10:28:33.692798][UC][I] Parent process exited, terminating worker queues [329270,330271][multiproc_executor.py:775,death_pipe_monitor] [2026-05-02 10:28:33.692812][UC][I] Parent process exited, terminating worker queues [329278,330263][multiproc_executor.py:775,death_pipe_monitor] [2026-05-02 10:28:33.692919][UC][I] WorkerProc shutting down. [329270,329270][multiproc_executor.py:872,worker_main] [2026-05-02 10:28:33.692960][UC][I] WorkerProc shutting down. [329278,329278][multiproc_executor.py:872,worker_main] (APIServer pid=328985) INFO: Shutting down (APIServer pid=328985) INFO: Waiting for application shutdown. (APIServer pid=328985) INFO: Application shutdown complete. (APIServer pid=328985) INFO: Finished server process [328985] (APIServer pid=328985) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute /usr/local/python3.11.14/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' PASSED ========================================================== warnings summary =========================================================== <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute vllm_ascend/patch/worker/patch_weight_utils.py:80 /root/vllm-workspace2/vllm-ascend/vllm_ascend/patch/worker/patch_weight_utils.py:80: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html module = original_import(name, globals, locals, fromlist, level) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================== 1 passed, 3 warnings in 342.04s (0:05:42) ============================================== sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute ``` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>
…8864) ## Summary Fix Ascend MoE dynamic EPLB routing after the upstream vLLM MoE/EPLB refactor. Upstream vLLM now distinguishes: - logical experts: the experts represented by router logits - physical/global experts: logical experts plus redundant EPLB replicas `router_logits.shape[-1]` matches the logical expert count, but Ascend MoE quant paths were comparing it against `moe_config.num_experts`, which can include redundant physical experts when dynamic EPLB is enabled. This caused: AssertionError: Number of global experts mismatch (excluding redundancy) in the Qwen3 MoE W8A8 dynamic EPLB TP2 test. ## Changes - Add a helper to resolve the logical expert count from moe_config.num_logical_experts, with a fallback for older configs. - Use logical expert count for: - router logits validation - expert selection - zero expert handling - profile force-load-balance random routing - Preserve physical/global expert count for dispatch and redundant expert handling. - Apply the same logical/physical split to related Ascend MoE quant paths to avoid the same bug in other quant modes. ## Root cause vLLM upstream PRs such as [#30623](vllm-project/vllm#30623) separated router logic into dedicated router classes and made EPLB map logical expert IDs to physical expert IDs after top-k selection. Ascend code still treated `moe_config.num_experts` as the `router-logits` expert count, but with dynamic EPLB it represents physical/global experts. ## Test ``` VLLM_USE_MODELSCOPE=True pytest -sv \ tests/e2e/multicard/2-cards/test_qwen3_moe.py::test_qwen3_moe_w8a8_distributed_tp2_ep_dynamic_eplb ``` ## Test Result ``` (APIServer pid=328985) INFO: 127.0.0.1:60556 - "POST /v1/completions HTTP/1.1" 200 OK [2026-05-02 10:28:33.692579][UC][I] Shutdown initiated (timeout=0) [329130,329130][core.py:1238,_handle_shutdown] [2026-05-02 10:28:33.692621][UC][I] Shutdown complete [329130,329130][core.py:1261,_handle_shutdown] [2026-05-02 10:28:33.692798][UC][I] Parent process exited, terminating worker queues [329270,330271][multiproc_executor.py:775,death_pipe_monitor] [2026-05-02 10:28:33.692812][UC][I] Parent process exited, terminating worker queues [329278,330263][multiproc_executor.py:775,death_pipe_monitor] [2026-05-02 10:28:33.692919][UC][I] WorkerProc shutting down. [329270,329270][multiproc_executor.py:872,worker_main] [2026-05-02 10:28:33.692960][UC][I] WorkerProc shutting down. [329278,329278][multiproc_executor.py:872,worker_main] (APIServer pid=328985) INFO: Shutting down (APIServer pid=328985) INFO: Waiting for application shutdown. (APIServer pid=328985) INFO: Application shutdown complete. (APIServer pid=328985) INFO: Finished server process [328985] (APIServer pid=328985) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute /usr/local/python3.11.14/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' PASSED ========================================================== warnings summary =========================================================== <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute vllm_ascend/patch/worker/patch_weight_utils.py:80 /root/vllm-workspace2/vllm-ascend/vllm_ascend/patch/worker/patch_weight_utils.py:80: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html module = original_import(name, globals, locals, fromlist, level) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================== 1 passed, 3 warnings in 342.04s (0:05:42) ============================================== sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute ``` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
…8864) ## Summary Fix Ascend MoE dynamic EPLB routing after the upstream vLLM MoE/EPLB refactor. Upstream vLLM now distinguishes: - logical experts: the experts represented by router logits - physical/global experts: logical experts plus redundant EPLB replicas `router_logits.shape[-1]` matches the logical expert count, but Ascend MoE quant paths were comparing it against `moe_config.num_experts`, which can include redundant physical experts when dynamic EPLB is enabled. This caused: AssertionError: Number of global experts mismatch (excluding redundancy) in the Qwen3 MoE W8A8 dynamic EPLB TP2 test. ## Changes - Add a helper to resolve the logical expert count from moe_config.num_logical_experts, with a fallback for older configs. - Use logical expert count for: - router logits validation - expert selection - zero expert handling - profile force-load-balance random routing - Preserve physical/global expert count for dispatch and redundant expert handling. - Apply the same logical/physical split to related Ascend MoE quant paths to avoid the same bug in other quant modes. ## Root cause vLLM upstream PRs such as [#30623](vllm-project/vllm#30623) separated router logic into dedicated router classes and made EPLB map logical expert IDs to physical expert IDs after top-k selection. Ascend code still treated `moe_config.num_experts` as the `router-logits` expert count, but with dynamic EPLB it represents physical/global experts. ## Test ``` VLLM_USE_MODELSCOPE=True pytest -sv \ tests/e2e/multicard/2-cards/test_qwen3_moe.py::test_qwen3_moe_w8a8_distributed_tp2_ep_dynamic_eplb ``` ## Test Result ``` (APIServer pid=328985) INFO: 127.0.0.1:60556 - "POST /v1/completions HTTP/1.1" 200 OK [2026-05-02 10:28:33.692579][UC][I] Shutdown initiated (timeout=0) [329130,329130][core.py:1238,_handle_shutdown] [2026-05-02 10:28:33.692621][UC][I] Shutdown complete [329130,329130][core.py:1261,_handle_shutdown] [2026-05-02 10:28:33.692798][UC][I] Parent process exited, terminating worker queues [329270,330271][multiproc_executor.py:775,death_pipe_monitor] [2026-05-02 10:28:33.692812][UC][I] Parent process exited, terminating worker queues [329278,330263][multiproc_executor.py:775,death_pipe_monitor] [2026-05-02 10:28:33.692919][UC][I] WorkerProc shutting down. [329270,329270][multiproc_executor.py:872,worker_main] [2026-05-02 10:28:33.692960][UC][I] WorkerProc shutting down. [329278,329278][multiproc_executor.py:872,worker_main] (APIServer pid=328985) INFO: Shutting down (APIServer pid=328985) INFO: Waiting for application shutdown. (APIServer pid=328985) INFO: Application shutdown complete. (APIServer pid=328985) INFO: Finished server process [328985] (APIServer pid=328985) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute /usr/local/python3.11.14/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' PASSED ========================================================== warnings summary =========================================================== <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute vllm_ascend/patch/worker/patch_weight_utils.py:80 /root/vllm-workspace2/vllm-ascend/vllm_ascend/patch/worker/patch_weight_utils.py:80: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html module = original_import(name, globals, locals, fromlist, level) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================== 1 passed, 3 warnings in 342.04s (0:05:42) ============================================== sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute ``` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: nanxing <1014662416@qq.com>
Summary
Fix Ascend MoE dynamic EPLB routing after the upstream vLLM MoE/EPLB refactor.
Upstream vLLM now distinguishes:
router_logits.shape[-1]matches the logical expert count, but Ascend MoE quant paths were comparing it againstmoe_config.num_experts, which can include redundant physical experts when dynamic EPLB is enabled. This caused:AssertionError: Number of global experts mismatch (excluding redundancy) in the Qwen3 MoE W8A8 dynamic EPLB TP2 test.
Changes
Root cause
vLLM upstream PRs such as #30623 separated router logic into dedicated router classes and made EPLB map logical expert IDs to physical expert IDs after top-k selection. Ascend code still treated
moe_config.num_expertsas therouter-logitsexpert count, but with dynamic EPLB it represents physical/global experts.Test
Test Result